97 research outputs found

    Algorithms for weighted multidimensional search and perfect phylogeny

    Get PDF
    This dissertation is a collection of papers from two independent areas: convex optimization problems in R[superscript]d and the construction of evolutionary trees;The paper on convex optimization problems in R[superscript]d gives improved algorithms for solving the Lagrangian duals of problems that have both of the following properties. First, in absence of the bad constraints, the problems can be solved in strongly polynomial time by combinatorial algorithms. Second, the number of bad constraints is fixed. As part of our solution to these problems, we extend Cole\u27s circuit simulation approach and develop a weighted version of Megiddo\u27s multidimensional search technique;The papers on evolutionary tree construction deal with the perfect phylogeny problem, where species are specified by a set of characters and each character can occur in a species in one of a fixed number of states. This problem is known to be NP-complete. The dissertation contains the following results on the perfect phylogeny problem: (1) A linear time algorithm when all the characters have two states. (2) A polynomial time algorithm when the number of character states is fixed. (3) A polynomial time algorithm when the number of characters is fixed

    Application of dissociation curve analysis to radiation hybrid panel marker scoring: generation of a map of river buffalo (B. bubalis) chromosome 20

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Fluorescence of dyes bound to double-stranded PCR products has been utilized extensively in various real-time quantitative PCR applications, including post-amplification dissociation curve analysis, or differentiation of amplicon length or sequence composition. Despite the current era of whole-genome sequencing, mapping tools such as radiation hybrid DNA panels remain useful aids for sequence assembly, focused resequencing efforts, and for building physical maps of species that have not yet been sequenced. For placement of specific, individual genes or markers on a map, low-throughput methods remain commonplace. Typically, PCR amplification of DNA from each panel cell line is followed by gel electrophoresis and scoring of each clone for the presence or absence of PCR product. To improve sensitivity and efficiency of radiation hybrid panel analysis in comparison to gel-based methods, we adapted fluorescence-based real-time PCR and dissociation curve analysis for use as a novel scoring method.</p> <p>Results</p> <p>As proof of principle for this dissociation curve method, we generated new maps of river buffalo (<it>Bubalus bubalis</it>) chromosome 20 by both dissociation curve analysis and conventional marker scoring. We also obtained sequence data to augment dissociation curve results. Few genes have been previously mapped to buffalo chromosome 20, and sequence detail is limited, so 65 markers were screened from the orthologous chromosome of domestic cattle. Thirty bovine markers (46%) were suitable as cross-species markers for dissociation curve analysis in the buffalo radiation hybrid panel under a standard protocol, compared to 25 markers suitable for conventional typing. Computational analysis placed 27 markers on a chromosome map generated by the new method, while the gel-based approach produced only 20 mapped markers. Among 19 markers common to both maps, the marker order on the map was maintained perfectly.</p> <p>Conclusion</p> <p>Dissociation curve analysis is reliable and efficient for radiation hybrid panel scoring, and is more sensitive and robust than conventional gel-based typing methods. Several markers could be scored only by the new method, and ambiguous scores were reduced. PCR-based dissociation curve analysis decreases both time and resources needed for construction of radiation hybrid panel marker maps and represents a significant improvement over gel-based methods in any species.</p

    Novel Gene Acquisition on Carnivore Y Chromosomes

    Get PDF
    Despite its importance in harboring genes critical for spermatogenesis and male-specific functions, the Y chromosome has been largely excluded as a priority in recent mammalian genome sequencing projects. Only the human and chimpanzee Y chromosomes have been well characterized at the sequence level. This is primarily due to the presumed low overall gene content and highly repetitive nature of the Y chromosome and the ensuing difficulties using a shotgun sequence approach for assembly. Here we used direct cDNA selection to isolate and evaluate the extent of novel Y chromosome gene acquisition in the genome of the domestic cat, a species from a different mammalian superorder than human, chimpanzee, and mouse (currently being sequenced). We discovered four novel Y chromosome genes that do not have functional copies in the finished human male-specific region of the Y or on other mammalian Y chromosomes explored thus far. Two genes are derived from putative autosomal progenitors, and the other two have X chromosome homologs from different evolutionary strata. All four genes were shown to be multicopy and expressed predominantly or exclusively in testes, suggesting that their duplication and specialization for testis function were selected for because they enhance spermatogenesis. Two of these genes have testis-expressed, Y-borne copies in the dog genome as well. The absence of the four newly described genes on other characterized mammalian Y chromosomes demonstrates the gene novelty on this chromosome between mammalian orders, suggesting it harbors many lineage-specific genes that may go undetected by traditional comparative genomic approaches. Specific plans to identify the male-specific genes encoded in the Y chromosome of mammals should be a priority

    Database indexing for production MegaBLAST searches

    Get PDF
    Motivation: The BLAST software package for sequence comparison speeds up homology search by preprocessing a query sequence into a lookup table. Numerous research studies have suggested that preprocessing the database instead would give better performance. However, production usage of sequence comparison methods that preprocess the database has been limited to programs such as BLAT and SSAHA that are designed to find matches when query and database subsequences are highly similar

    Single haplotype assembly of the human genome from a hydatidiform mole

    Get PDF
    A complete reference assembly is essential for accurately interpreting individual genomes and associating variation with phenotypes. While the current human reference genome sequence is of very high quality, gaps and misassemblies remain due to biological and technical complexities. Large repetitive sequences and complex allelic diversity are the two main drivers of assembly error. Although increasing the length of sequence reads and library fragments can improve assembly, even the longest available reads do not resolve all regions. In order to overcome the issue of allelic diversity, we used genomic DNA from an essentially haploid hydatidiform mole, CHM1. We utilized several resources from this DNA including a set of end-sequenced and indexed BAC clones and 100× Illumina whole-genome shotgun (WGS) sequence coverage. We used the WGS sequence and the GRCh37 reference assembly to create an assembly of the CHM1 genome. We subsequently incorporated 382 finished BAC clone sequences to generate a draft assembly, CHM1_1.1 (NCBI AssemblyDB GCA_000306695.2). Analysis of gene, repetitive element, and segmental duplication content show this assembly to be of excellent quality and contiguity. However, comparison to assembly-independent resources, such as BAC clone end sequences and PacBio long reads, indicate misassembled regions. Most of these regions are enriched for structural variation and segmental duplication, and can be resolved in the future. This publicly available assembly will be integrated into the Genome Reference Consortium curation framework for further improvement, with the ultimate goal being a completely finished gap-free assembly

    High-throughput sequencing of the DBA/2J mouse genome

    Get PDF
    The DBA/2J mouse is not only the oldest inbred strain, but also one of the most widely used strains. DBA/2J exhibits many unique anatomical, physiological, and behavior traits. In addition, DBA/2J is one parent of the large BXD family of recombinant inbred strains [1]. The genome of the other parent of this BXD family— C57BL/6J—has been sequenced and serves as the mouse reference genome [2]. We sequenced the genome of DBA/2J using SOLiD and Illumina high throughput short read protocols to generate a comprehensive set of ~5 million sequence variants segregating in the BXD family that ultimately cause developmental, anatomical, functional and behavioral differences among these 80+ strains

    PedHunter 2.0 and its usage to characterize the founder structure of the Old Order Amish of Lancaster County

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Because they are a closed founder population, the Old Order Amish (OOA) of Lancaster County have been the subject of many medical genetics studies. We constructed four versions of Anabaptist Genealogy Database (AGDB) using three sources of genealogies and multiple updates. In addition, we developed PedHunter, a suite of query software that can solve pedigree-related problems automatically and systematically.</p> <p>Methods</p> <p>We report on how we have used new features in PedHunter to quantify the number and expected genetic contribution of founders to the OOA. The queries and utility of PedHunter programs are illustrated by examples using AGDB in this paper. For example, we calculated the number of founders expected to be contributing genetic material to the present-day living OOA and estimated the mean relative founder representation for each founder. New features in PedHunter also include pedigree trimming and pedigree renumbering, which should prove useful for studying large pedigrees.</p> <p>Results</p> <p>With PedHunter version 2.0 querying AGDB version 4.0, we identified 34,160 presumed living OOA individuals and connected them into a 14-generation pedigree descending from 554 founders (332 females and 222 males) after trimming. From the analysis of cumulative mean relative founder representation, 128 founders (78 females and 50 males) accounted for over 95% of the mean relative founder contribution among living OOA descendants.</p> <p>Discussion/Conclusions</p> <p>The OOA are a closed founder population in which a modest number of founders account for the genetic variation present in the current OOA population. Improvements to the PedHunter software will be useful in future studies of both the OOA and other populations with large and computerized genealogies.</p

    The National Center for Biotechnology Information's Protein Clusters Database

    Get PDF
    Rapid increases in DNA sequencing capabilities have led to a vast increase in the data generated from prokaryotic genomic studies, which has been a boon to scientists studying micro-organism evolution and to those who wish to understand the biological underpinnings of microbial systems. The NCBI Protein Clusters Database (ProtClustDB) has been created to efficiently maintain and keep the deluge of data up to date. ProtClustDB contains both curated and uncurated clusters of proteins grouped by sequence similarity. The May 2008 release contains a total of 285 386 clusters derived from over 1.7 million proteins encoded by 3806 nt sequences from the RefSeq collection of complete chromosomes and plasmids from four major groups: prokaryotes, bacteriophages and the mitochondrial and chloroplast organelles. There are 7180 clusters containing 376 513 proteins with curated gene and protein functional annotation. PubMed identifiers and external cross references are collected for all clusters and provide additional information resources. A suite of web tools is available to explore more detailed information, such as multiple alignments, phylogenetic trees and genomic neighborhoods. ProtClustDB provides an efficient method to aggregate gene and protein annotation for researchers and is available at http://www.ncbi.nlm.nih.gov/sites/entrez?db=proteinclusters

    Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches

    Get PDF
    Protein sequence database search programs may be evaluated both for their retrieval accuracy—the ability to separate meaningful from chance similarities—and for the accuracy of their statistical assessments of reported alignments. However, methods for improving statistical accuracy can degrade retrieval accuracy by discarding compositional evidence of sequence relatedness. This evidence may be preserved by combining essentially independent measures of alignment and compositional similarity into a unified measure of sequence similarity. A version of the BLAST protein database search program, modified to employ this new measure, outperforms the baseline program in both retrieval and statistical accuracy on ASTRAL, a SCOP-based test set
    corecore